#install.packages("nycflights13")
library(nycflights13)
library(tidyverse)Slides: Data Manipulation and Transformation
Actuarial Data Science Online Textbook
Reading List
- R for Data Science Online Book, Chapters 5, 13
- Applied predictive Modelling, 3.3 (only transformations to resolve outliers), 3.4
Introduction
Motivation
- It is rare that you get the data in exactly the right form you need.
- You’ll need to create some new variables or summaries, or
- you just want to rename the variables or reorder the observations in order to make the data a little easier to work with.
Using R to manipulate data
- R package:
dplyr(a core member oftidyverse) for data manipulation and transformation1 - Data:
nycflights13package, flights departing New York City in 2013 - We use
ggplot2to help us understand the data.
Data
flights #tibble, tweaked data frame to work better in tidyverseview(flights) # will open the dataset in the RStudio viewerTypes of variables
intstands for integers.dblstands for doubles, or real numbers.chrstands for character vectors, or strings.dttmstands for date-times (a date + a time).lglstands for logical, vectors that contain only TRUE or FALSE.fctrstands for factors, which R uses to represent categorical variables with fixed possible values.datestands for dates.
Data Manipulation Functions
Functions for data manipulation
Functions in dplyr package
%>%Pipe operatorglimpse()A glimpse into the data and its structurefilter()Pick observations by their values
arrange()Reorder the rows
select()Pick variables by their names
summarise()Collapse many values down to a single
group_by()Changes the scope of each function above from operating on the entire dataset to operating on it group-by-group
Filter: Introduction
#The first argument is the name of the data frame.
#The subsequent arguments are the expressions that filter the data frame.
filter(flights, month == 1, day == 1)# use the assignment operator, <- to save the result
#jan1 <- filter(flights, month == 1, day == 1)
# Save and print the result at the same time
#(dec25 <- filter(flights, month == 12, day == 25))Filter: Comparisons
To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite:
>bigger than
>=bigger than or equal to<less than
<=less than or equal to!=not equal==equal2
Filter: Logical Operators
- Multiple arguments to
filter()are combined with “and”: every expression must be true in order for a row to be included in the output.
Other types of combinations using Boolean operators:
&is “and”|is “or”!is “not”x %in% yselect every row wherexis one of the values iny
According to De Morgan’s law,
!(x & y)is the same as!x | !y!(x | y)is the same as!x & !y
Exercise 1
- Find all flights that departed in November or December.
filter(flights, month %in% c(11, 12))#Alternatively, use the following
#filter(flights, month == 11 | month == 12)Exercise 2
- Find flights that weren’t delayed (on arrival or departure) by more than two hours.
filter(flights, !(arr_delay > 120 | dep_delay > 120))#Alternatively, use the following
#filter(flights, arr_delay <= 120, dep_delay <= 120)Filter: Missing Values
NArepresents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.filter()only includes rows where the condition isTRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)filter(df, is.na(x) | x > 1)Arrange
arrange(): Order by column names (or more complicated expressions)- Use
desc()to re-order by a column in descending order - Missing values are always sorted at the end
arrange(flights, year)#arrange(flights, desc(dep_delay))Select
select()allows you to select a useful subset based on the names of the variables.
select(flights, year, month, day)#select(flights, year:day)
#select(flights, -(year:day))Select: Useful functions
starts_with("abc"): matches names that begin with “abc”.ends_with("xyz"): matches names that end with “xyz”.contains("ijk"): matches names that contain “ijk”.matches(): selects variables that match a regular expression.everything()
Other related functions:
rename(): rename variablesmutate(): create new variables with functions of existing variables
Summarise
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))Summarise: with group-by
- This changes the unit of analysis from the complete dataset to individual groups.
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
Summarise: Exercise with multiple operations
- Explore the relationship between the distance and average delay for each location
- There are three steps to prepare this data:
- Group flights by destination
- Summarise to compute distance, average delay, and number of flights
- Filter to remove noisy points (counts below or equal to 20) and Honolulu (“HNL”) airport, which is almost twice as far away as the next closest airport
Please have a try!
Exercise: code
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")
# It looks like delays increase with distance up to ~750 miles
# and then decrease. Maybe as flights get longer there's more
# ability to make up delays in the air?
plot=ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'Exercise: plot
print(plot)
Exercise: Another way to do it using pipe %>%
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")Missing values
na.rm=TRUEremoves the missing values
flights %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay, na.rm = TRUE))Useful Summary functions
- Measures of location (central tendency):
mean(x),median(x) - Measures of spread (variability):
sd(x),IQR(x)3 - Measures of rank:
min(x),quantile(x, 0.25),max(x) - Measures of position:
first(x),nth(x, 2),last(x) - Counts:
n(),sum(!is.na(x)),n_distinct(x)count(tailnum, wt = distance), “count” (sum) the total number of miles a plane flew
- Counts and proportions of logical values:
sum(x > 10),mean(y == 0)
Grouping by multiple variables
daily <- group_by(flights, year, month, day)
#n() returns the size of the current group
(per_day <- summarise(daily, flights = n()))`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
Ungrouping
daily %>%
ungroup() %>% # no longer grouped by date
summarise(flights = n()) # all flightsRelational Data
Relational Data
- Relational data: multiple tables of data that are related.
Three families of verbs designed to work with relational data:
Mutating joins: add new variables to one data frame from matching observations in another.
Filtering joins: filter observations from one data frame based on whether or not they match an observation in the other table.
Set operations: treat observations as if they were set elements.
Other similar database system: SQL
Data set
nycflights13contains five tibbles that are related to each other:flights: gives information about each flightairlines: lets you look up the full carrier nameairports: gives information about each airport, identified by thefaaairport codeplanes: gives information about each plane, identified by itstailnumweather: gives the weather at each NYC airport for each hour
Table Relations
- Each relation always concerns a pair of tables
- Understand the chain of relations between the tables that you are interested in.

Keys
- Primary key: uniquely identifies an observation in its own table.
- For example,
planes$tailnumis a primary key because it uniquely identifies each plane in the planes table.
- For example,
- Foreign key: uniquely identifies an observation in another table.
- For example,
flights$tailnumis a foreign key because it appears in the flights table where it matches each flight to a unique plane.
- For example,
- A variable can be both a primary key and a foreign key. For example, origin is part of the weather primary key, and is also a foreign key for the airport table.
Identify the primary keys
count()the primary keys and look for entries wherenis greater than one.
planes %>%
count(tailnum) %>%
filter(n > 1)Add a primary key
- What’s the primary key in the
flightstable?- None
- Surrogate key:add one primary key with
mutate()androw_number() - A primary key and the corresponding foreign key in another table form a relation.
- Relations are typically one-to-many. You can model many-to-many relations with a many-to-1 relation plus a 1-to-many relation.
Exercise: Add a surrogate key to flights
flights %>%
arrange(year, month, day, sched_dep_time, carrier, flight) %>%
mutate(flight_id = row_number()) %>%
#TThis makes it possible to see every column in a data frame.
glimpse()Rows: 336,776
Columns: 20
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, 559, 558, 559, 558, 558, 557, …
$ sched_dep_time <int> 515, 529, 540, 545, 558, 559, 600, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -4, 0, -2, -1, -2, -2, -3, NA, 1, 0, -5, -…
$ arr_time <int> 830, 850, 923, 1004, 740, 702, 753, 941, 849, 853, 838,…
$ sched_arr_time <int> 819, 830, 850, 1022, 728, 706, 745, 910, 851, 856, 846,…
$ arr_delay <dbl> 11, 20, 33, -18, 12, -4, 8, 31, -2, -3, -8, NA, -6, -7,…
$ carrier <chr> "UA", "UA", "AA", "B6", "UA", "B6", "AA", "AA", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 1696, 1806, 301, 707, 49, 71, 79…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N39463", "N708…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "EWR", "JFK", "LGA", "LGA",…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ORD", "BOS", "ORD", "DFW",…
$ air_time <dbl> 227, 227, 160, 183, 150, 44, 138, 257, 149, 158, 140, N…
$ distance <dbl> 1400, 1416, 1089, 1576, 719, 187, 733, 1389, 1028, 1005…
$ hour <dbl> 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 58, 59, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
$ flight_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
Mutating Joins
- Mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.
- Add columns from y to x:
inner_join(): keeps observations that appear in both tables.left_join(): keeps all observations in x.right_join(): keeps all observations in y.full_join(): keeps all observations in x and y.
flights %>%
select(year:day, hour, tailnum, carrier) %>%
left_join(airlines, by = "carrier") #by = "key"Mutating Joins: The Key Columns
- When you join duplicated keys, you get all possible combinations.
- Defining the key columns
by=NULL: uses all variables that appear in both tables, the so called natural joinby = "x": uses only some of the common variables.by = c("a" = "b"): match variable a in table x to variable b in table y
Mutating Joins: base::merge()
| dplyr | merge |
|---|---|
| inner_join(x, y) | merge(x, y) |
| left_join(x, y) | merge(x, y, all.x = TRUE) |
| right_join(x, y) | merge(x, y, all.y = TRUE) |
| full_join(x, y) | merge(x, y, all.x = TRUE, all.y = TRUE) |
- dplyr’s joins are considerably faster and don’t mess with the order of the rows.
Filtering Joins
- Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables.
- There are two types:
- semi_join(x, y) keeps all observations in x that have a match in y.
- anti_join(x, y) drops all observations in x that have a match in y.
Exercise 1
- Questions:
- find the top ten most popular destinations
- match it back to
flights
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
flights %>%
semi_join(top_dest)Joining with `by = join_by(dest)`
Exercise 2
- Question: when connecting flights and planes, what are the flights that don’t have a match in planes?
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)Join problems
- Start by identifying the variables that form the primary key in each table.
- Check that none of the variables in the primary key are missing. If a value is missing then it can’t identify an observation!
- Check that your foreign keys match primary keys in another table. The best way to do this is with an anti_join().
Set Operations
intersect(x, y): return only observations in both x and y.union(x, y): return unique observations in x and y.setdiff(x, y): return observations in x, but not in y.
Examples
df1 <- tribble(
~x, ~y,
1, 1,
2, 1
)
df2 <- tribble(
~x, ~y,
1, 1,
1, 2
)
intersect(df1, df2)union(df1, df2)setdiff(df1, df2) #setdiff(df2, df1)Footnotes
dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag()↩︎
Be cautions when using
==: floating point numbers. Consider usingnear()↩︎The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles. Q1 is the “middle” value in the first half of the rank-ordered data set. Q2 is the median value in the set. Q3 is the “middle” value in the second half of the rank-ordered data set. The interquartile range is equal to Q3 minus Q1.↩︎